Introducing Voice Agent Simulations

When your agents
get complex

Surface patterns in production, turn them into scenarios, and improve quality with every release. For teams that can't afford to get it wrong.

claude code~/voice-agent

simulation — qualified senior candidate
waiting for the assistant…
Trusted in production by
BackbasePagBankVismaDeloitteAlturaVinnyFreeday
Why LangWatch

Built for where agent quality gets hard.

Two problems every team hits as their agents grow. And how we solve them.

Problem 01

A single eval can’t keep up with a complex agent.

Every team starts with evals. But when an agent uses five tools across a ten-turn conversation, one score on the final answer doesn’t tell you much. You need to see whether the right tool fired, and where the conversation broke.

LangWatch is the only platform that runs simulations and evals side by side.

Problem 02

Engineers shouldn’t own quality alone.

Engineers build the stack, the pipelines, the first eval set. From there, the best evals come from whoever knows the user best. That’s often a PM or domain expert, not the engineer.

Write scenarios with product. Hand the eval suite over sooner.

The platform

One platform, four pillars.

Agent testing, evals, traces and governance. Open by default, OpenTelemetry-native, runs against any model.

01 / 04

Agent testing

Test agents end-to-end with multi-turn simulations. A user simulator drives real conversations, a judge scores every turn, and you catch the failures single-shot evals miss.

  • Multi-turn simulations of real users
  • Per-turn judge with pass/fail criteria
  • Powered by Scenario, MIT-licensed OSS
  • Runs locally or in CI
[ Explore Scenario ]
langwatch · agent-testing
simulation — qualified senior candidate
0:00 / 0:17

Hello, and thank you for joining the interview. I am an AI assistant conducting this interview — the conversation may be recorded and assessed, and you can request a human at any time. Let's start: could you tell me about a recent project where you led the development of an LLM evaluation tool?

Open by default

OpenTelemetry-native, MIT-licensed, runs against any model.

Models
OpenAIAnthropicAzureAWS BedrockGoogle Vertex AIGroqOllama
OpenAI- + Anthropic- + OTLP-compatible. Drop in, no rewrites.
Frameworks
LangChainLangGraphVercel AI SDKMastraCrewAIGoogle ADKLangFlowFlowisen8n
Works with the frameworks your team already uses.
Langy

Our AI tests your AI.

Langy turns a PM's goal into a full Scenario test plan — then turns the failures into pull requests.

PMs own the spec. Devs stay in flow. Nothing slips through.

  1. PM writes the goalno codePlain English. No code, no YAML — the brief is the spec.
  2. Langy drafts the planlivePicks the simulator, generates the scenarios, writes the JudgeAgent rubric.
  3. Scenario runs in parallelparallelMulti-turn conversations against your agent, concurrent across projects.
  4. JudgeAgent scores itsignedYour rubric, audited — faithfulness, policy adherence, de-escalation.
  5. Regressions become PRsready to shipLangy drafts the prompt revision. Devs review and ship via Prompt Registry.
langy · live session
goalplanrunscoreship
pm · goal· pending
langy · plan· pending
langy · run· pending
langy · judge· pending
langy · ship· pending
median PM-to-PR 14 minuteswatch Langy work →
scenario · support-triage / candidate-2026.06
running · 1,247 / 2,000
user-simjudgered-teamvoice
scenariopassverdict
1,247 conversations · 8,130 turns · 0 pass · 0 flagView triage queue →
Scenario

A thousand conversations before a single user.

LangWatch's open-source agent testing framework. Drop in to any agent in Python, TypeScript, or Go. Drive multi-turn flows yourself, or let UserSimulatorAgent play the user. JudgeAgent scores every turn. RedTeamAgent finds the edges.

  • Framework agnosticWraps any agent — LangGraph, CrewAI, Mastra, plain code.
  • Concurrent at scalePer-call isolation (ADR-001) — batch across projects.
  • Text, voice, adversarialSame suite, any modality. Voice loops + Crescendo built in.
  • MIT licensed
Live suite — Suite settles; the regression opens itself for evidence

Controls your security team will sign off on.

Production AI shouldn't ship without RBAC, audit trails, cost attribution and a key-revocation story. LangWatch makes those a first-class pillar — not a roadmap promise.
  • RBAC + REST APIsTeams, projects, API keys — scoped at the role level.
  • SCIM + SSOOkta / Azure AD / Google. Group → team auto-assignment.
  • Cost-center attributionSpend tracked across members, teams, projects.
  • Audit log → SIEMEvery prompt change, eval edit, key event — signed.
audit · workspace enron-prod · 24h
3,144 events · 0 anomalies
ActorActionTargetTime
skeeter@enron.comcreated api_keyprod-readonly · cost-center: shred-quarterly14:21:02
scim:gandalfassigned rolemaintainer · team Mordor/EU14:18:44
hank@globex.compromoted promptsupport-triage v12.1.0 → 100% traffic14:09:11
feature-flag-svcflipped flagevals.judge_v2 → prod · rollout 25%13:52:30
felicia@taylorswift.increvoked api_keystaging-token-3219 · "shake it off"13:40:18
patrick@pierce.comupdated evaluatorPII-guard · regex pattern13:11:55
signed · exportableStream to SIEM →

One endpoint. Every provider.

A drop-in proxy that speaks OpenAI- and Anthropic-compatible. Set a base URL, keep your existing SDK. Get automatic provider fallback, per-team budgets, Anthropic cache_control passthrough, and every request lands as a LangWatch trace.
  • OpenAI-compatibleNo SDK changes. Point base_url at the gateway.
  • Provider fallbackAutomatic spillover on rate limit, error or latency.
  • Per-team budgetsCost-center aware. Block or alert before spend.
  • cache_control passthroughAnthropic prompt caching honored end-to-end.
client.py
drop-in
from openai import OpenAI

client = OpenAI(
    base_url="https://gateway.langwatch.ai/v1",
    default_headers={"x-cost-center": "payments-eu"},
)
routing · last 1h · 14,202 req
fallback rate 0.4%
openaigpt-4o
60%
anthropicclaude-sonnet
30%
awsbedrock/claude-haiku
8%
googlevertex/gemini-pro
2%
Customers

Trusted by teams shipping mission-critical AI.

(Names changed to protect the no-longer-innocent.)

"LangWatch became our single source of truth for agent quality. Regressions get caught in evals before they ever reach a quarterly earnings call."
Skeeter McGee
Head of AI Reliability · Enron
"We went from "we hope it works" to a deploy gate backed by 800 evaluators. Product and engineering finally agree on what good means."
Hank Scorpio
CEO · Globex Corporation
"Scenario alone paid for the year. We replayed every release against a synthetic user set and caught a tool-routing bug 48 hours before rollout."
Patrick Bateman
VP, Mergers & Acquisitions · Pierce & Pierce
17.4B
tokens traced / month
240k
evals run / day
42 ms
p95 latency
99.99%
uptime SLA
Ready when you are

Ship agents with confidence.

Thirty minutes with a LangWatch solutions engineer, your stack, live, end to end.

No credit card · Cloud · VPC · Self-hosted · Local